Fixes fragment loading to properly pack 16b values into a 32b register#2
Fixes fragment loading to properly pack 16b values into a 32b register#2diptorupd merged 4 commits intoamd-integrationfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a critical endianness bug in the HIP MMA fragment loading functions. The implementation was incorrectly packing 16-bit half values into 32-bit registers in big-endian format instead of the little-endian format required by CDNA3 ISA.
- Fixed
load_fragmentfunction to use direct memory access instead of bit shifting - Corrected
transpose_4x4_half_registersfunction to work with the fixed endianness - Removed obsolete
load_fragment_transposefunction and replaced its usage with proper transposition
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| libflashinfer/include/gpu_iface/backend/hip/mma_hip.h | Core fix for endianness bug in fragment loading and transpose operations |
| libflashinfer/include/gpu_iface/mma_ops.hpp | Removed obsolete transpose function and fixed template parameter |
| libflashinfer/tests/hip/test_mfma_fp32_16x16x16fp16.cpp | Updated test to use corrected functions and improved indexing logic |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| template <typename T> | ||
| __device__ __forceinline__ void load_fragment_transpose_4x4_half_registers(uint32_t* R, | ||
| const T* smem_ptr) { | ||
| static_assert(std::is_same<T, int>::value, | ||
| static_assert(std::is_same<T, __half>::value, | ||
| "Only __half is supported for the 4x4 register transpose"); |
There was a problem hiding this comment.
The static_assert checks for type T but the function call always uses __half. This creates a mismatch - either the template parameter T should be used in the function call, or the static_assert should be removed since T is not used.
| __device__ __forceinline__ void load_fragment_4x4_half_registers(uint32_t* R, const T* smem_ptr) { | ||
| static_assert(std::is_same_v<T, __half>(), "Only half type is supported"); | ||
| // Each thread loads 4 __half values in two 32b registers. | ||
| static_assert(std::is_same_v<T, __half>, "Only half type is supported"); |
There was a problem hiding this comment.
The static_assert uses std::is_same_v which requires C++17, but the codebase appears to use C++14/C++11 style elsewhere (e.g., std::is_same<T, __half>::value in the other file). This should be consistent with the project's C++ standard.
| static_assert(std::is_same_v<T, __half>, "Only half type is supported"); | |
| static_assert(std::is_same<T, __half>::value, "Only half type is supported"); |
rtmadduri
left a comment
There was a problem hiding this comment.
All these changes look good to me.
This PR introduces a patch to includes
Tested with Unit Tests:
Test project /root/amd_eng/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/4 Test #1: MathTest ......................... Passed 3.25 sec
Start 2: PosEncTest
2/4 Test #2: PosEncTest ....................... Passed 3.25 sec
Start 3: CascadeTest
3/4 Test #3: CascadeTest ...................... Passed 3.24 sec
Start 4: PageTest
4/4 Test #4: PageTest ......................... Passed 161.15 sec
100% tests passed, 0 tests failed out of 4
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues
```
Start 1: MathTest
1/6 Test #1: MathTest ......................... Passed 3.31 sec
Start 2: PosEncTest
2/6 Test #2: PosEncTest ....................... Passed 3.36 sec
Start 3: CascadeTest
3/6 Test #3: CascadeTest ...................... Passed 3.35 sec
Start 4: PageTest
4/6 Test #4: PageTest ......................... Passed 114.08 sec
Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest ................. Passed 35.22 sec
Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest .................. Passed 559.75 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 719.07 sec
```
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`
This PR has been tested locally
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/6 Test #1: MathTest ......................... Passed 3.40 sec
Start 2: PosEncTest
2/6 Test #2: PosEncTest ....................... Passed 3.40 sec
Start 3: CascadeTest
3/6 Test #3: CascadeTest ...................... Passed 985.27 sec
Start 4: PageTest
4/6 Test #4: PageTest ......................... Passed 112.40 sec
Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest ................. Passed 35.46 sec
Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest .................. Passed 556.81 sec
100% tests passed, 0 tests failed out of 6
```
To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.
Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.
```Test project /root/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/8 Test #1: MathTest ............................ Passed 0.31 sec
Start 2: PosEncTest
2/8 Test #2: PosEncTest .......................... Passed 0.31 sec
Start 3: CascadeTest
3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec
Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec
Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec
Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec
Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec
Start 8: test_rowsum
8/8 Test #8: test_rowsum ......................... Passed 0.27 sec
100% tests passed, 0 tests failed out of 8
```
#2) * Fix load_fragment * Fix Array OOO access in debug function * Fixes based on Copilot review. * More Copilot review based fixes
This PR introduces a patch to includes
Tested with Unit Tests:
Test project /root/amd_eng/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/4 Test ROCm#1: MathTest ......................... Passed 3.25 sec
Start 2: PosEncTest
2/4 Test ROCm#2: PosEncTest ....................... Passed 3.25 sec
Start 3: CascadeTest
3/4 Test ROCm#3: CascadeTest ...................... Passed 3.24 sec
Start 4: PageTest
4/4 Test ROCm#4: PageTest ......................... Passed 161.15 sec
100% tests passed, 0 tests failed out of 4
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues
```
Start 1: MathTest
1/6 Test ROCm#1: MathTest ......................... Passed 3.31 sec
Start 2: PosEncTest
2/6 Test ROCm#2: PosEncTest ....................... Passed 3.36 sec
Start 3: CascadeTest
3/6 Test ROCm#3: CascadeTest ...................... Passed 3.35 sec
Start 4: PageTest
4/6 Test ROCm#4: PageTest ......................... Passed 114.08 sec
Start 5: SingleDecodeTest
5/6 Test ROCm#5: SingleDecodeTest ................. Passed 35.22 sec
Start 6: BatchDecodeTest
6/6 Test ROCm#6: BatchDecodeTest .................. Passed 559.75 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 719.07 sec
```
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`
This PR has been tested locally
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/6 Test ROCm#1: MathTest ......................... Passed 3.40 sec
Start 2: PosEncTest
2/6 Test ROCm#2: PosEncTest ....................... Passed 3.40 sec
Start 3: CascadeTest
3/6 Test ROCm#3: CascadeTest ...................... Passed 985.27 sec
Start 4: PageTest
4/6 Test ROCm#4: PageTest ......................... Passed 112.40 sec
Start 5: SingleDecodeTest
5/6 Test ROCm#5: SingleDecodeTest ................. Passed 35.46 sec
Start 6: BatchDecodeTest
6/6 Test ROCm#6: BatchDecodeTest .................. Passed 556.81 sec
100% tests passed, 0 tests failed out of 6
```
To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.
Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.
```Test project /root/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/8 Test ROCm#1: MathTest ............................ Passed 0.31 sec
Start 2: PosEncTest
2/8 Test ROCm#2: PosEncTest .......................... Passed 0.31 sec
Start 3: CascadeTest
3/8 Test ROCm#3: CascadeTest ......................... Passed 1369.12 sec
Start 4: SingleDecodeTest
4/8 Test ROCm#4: SingleDecodeTest .................... Passed 7726.35 sec
Start 5: BatchDecodeTest
5/8 Test ROCm#5: BatchDecodeTest ..................... Passed 811.61 sec
Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test ROCm#6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec
Start 7: test_transpose_4x4_half_registers
7/8 Test ROCm#7: test_transpose_4x4_half_registers ... Passed 0.28 sec
Start 8: test_rowsum
8/8 Test ROCm#8: test_rowsum ......................... Passed 0.27 sec
100% tests passed, 0 tests failed out of 8
```
ROCm#2) * Fix load_fragment * Fix Array OOO access in debug function * Fixes based on Copilot review. * More Copilot review based fixes
This PR introduces a patch to includes
Tested with Unit Tests:
Test project /root/amd_eng/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/4 Test #1: MathTest ......................... Passed 3.25 sec
Start 2: PosEncTest
2/4 Test #2: PosEncTest ....................... Passed 3.25 sec
Start 3: CascadeTest
3/4 Test #3: CascadeTest ...................... Passed 3.24 sec
Start 4: PageTest
4/4 Test #4: PageTest ......................... Passed 161.15 sec
100% tests passed, 0 tests failed out of 4
This PR fixes some of the unit test failures that occur in Single
Decode. It also disables clang formatting of headers.
The clang format of headers causes compilation issues. The compiler is
unable to find `HIP WARP SYNC INTRINSICS` causing failures. Disabling
clang format fixes these issues
```
Start 1: MathTest
1/6 Test #1: MathTest ......................... Passed 3.31 sec
Start 2: PosEncTest
2/6 Test #2: PosEncTest ....................... Passed 3.36 sec
Start 3: CascadeTest
3/6 Test #3: CascadeTest ...................... Passed 3.35 sec
Start 4: PageTest
4/6 Test #4: PageTest ......................... Passed 114.08 sec
Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest ................. Passed 35.22 sec
Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest .................. Passed 559.75 sec
100% tests passed, 0 tests failed out of 6
Total Test time (real) = 719.07 sec
```
CPP test suite was using `hipified` headers. In this PR, we port over unit tests to use `gpu_iface`. This is necessary for us as the next step is to move the build infrastructure to use `gpu_iface`
This PR has been tested locally
```
Test project /root/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/6 Test #1: MathTest ......................... Passed 3.40 sec
Start 2: PosEncTest
2/6 Test #2: PosEncTest ....................... Passed 3.40 sec
Start 3: CascadeTest
3/6 Test #3: CascadeTest ...................... Passed 985.27 sec
Start 4: PageTest
4/6 Test #4: PageTest ......................... Passed 112.40 sec
Start 5: SingleDecodeTest
5/6 Test #5: SingleDecodeTest ................. Passed 35.46 sec
Start 6: BatchDecodeTest
6/6 Test #6: BatchDecodeTest .................. Passed 556.81 sec
100% tests passed, 0 tests failed out of 6
```
To replicate the tests
```
cd flashinfer/libflashinfer/tests/hip
```
```
mkdir build && cd build/
```
```
cmake -DCMAKE_PREFIX_PATH=/root/libtorch -DCMAKE_CXX_COMPILER:PATH=/opt/rocm/bin/amdclang++ -DFLASHINFER_INCLUDE_DIRS=/root/flashinfer/libflashinfer/include/ ..
```
```
make
```
```
ctest
```
In this PR I remove the `libtorch` dependency and removed
`test_page.cpp`. `test_page.cpp` is the only unit test that uses
libtorch. However, we also have a pytest for testing page. We will use
that for validation.
Removing the libtorch dependency will help us speed docker builds and
remove additional dependencies.
```Test project /root/flashinfer/libflashinfer/tests/hip/build
Start 1: MathTest
1/8 Test #1: MathTest ............................ Passed 0.31 sec
Start 2: PosEncTest
2/8 Test #2: PosEncTest .......................... Passed 0.31 sec
Start 3: CascadeTest
3/8 Test #3: CascadeTest ......................... Passed 1369.12 sec
Start 4: SingleDecodeTest
4/8 Test #4: SingleDecodeTest .................... Passed 7726.35 sec
Start 5: BatchDecodeTest
5/8 Test #5: BatchDecodeTest ..................... Passed 811.61 sec
Start 6: test_mfma_fp32_16x16x16fp16
6/8 Test #6: test_mfma_fp32_16x16x16fp16 ......... Passed 0.30 sec
Start 7: test_transpose_4x4_half_registers
7/8 Test #7: test_transpose_4x4_half_registers ... Passed 0.28 sec
Start 8: test_rowsum
8/8 Test #8: test_rowsum ......................... Passed 0.27 sec
100% tests passed, 0 tests failed out of 8
```
#2) * Fix load_fragment * Fix Array OOO access in debug function * Fixes based on Copilot review. * More Copilot review based fixes
The PR fixes a bug in the HIP MMA load_fragment function. The existing implementation was incorrectly packing two 16b half values into a 32b register in a Big endian format. Since CDNA3 is a little endian ISA, it was causing incorrect values to be populated in the registers.
The PR:
load_fragmentfunction and simplifies the logic to not require bitwise shifts.transpose_4x4_half_registersfunction that was implemented to work based on the incorrect logic.load_fragment_transposefunction as it is not required.transpose_4x4_half_registersfunction.PS: The
test_mfma_fp32_16x16x16fp16.cppalso implements the same compute as needed incompute_sfm_vand correctly implements the traversal logic for the B-matrix.